This PR adds a new Wazuh integration for Wazuh decoder rule generation tool#79
Merged
Conversation
…s and auto-enable split mode for CEF logs
… formats for more reliable extraction
…ue logs instead of truncating prefixes
…source pattern learning
…ders from all logs
…tterns and full preceding words instead of truncating prefixes
…d generalize them to \d+ to prevent brittle anchors
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces a new wazuh_decoder_rule_tool integration: a FastAPI-based UI/API for analyzing pasted logs, optionally validating them via wazuh-logtest, and generating Wazuh decoder/rule XML. It also adds an “enhanced” ML decoder-similarity approach (TF‑IDF + SBERT) plus scripts/datasets to train a custom similarity model from Wazuh ruleset test data.
Changes:
- Add the FastAPI app’s HTML/JS/CSS frontend and supporting backend utilities for decoder/rule generation workflows.
- Add ML enhancements: ensemble similarity model wrapper, dataset builder + training script, and accompanying tests/docs.
- Add local datasets and TLS artifacts for local HTTPS testing (currently including private keys).
Reviewed changes
Copilot reviewed 21 out of 26 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| integrations/wazuh_decoder_rule_tool/tests/test_ml_enhanced.py | Adds unit tests for enhanced ML similarity components. |
| integrations/wazuh_decoder_rule_tool/tests/test_integration.py | Adds a basic integration test for enhanced ML model loading. |
| integrations/wazuh_decoder_rule_tool/scripts/train_similarity.py | Adds SBERT contrastive training script for decoder similarity. |
| integrations/wazuh_decoder_rule_tool/scripts/build_dataset.py | Adds script to build training/validation datasets from Wazuh rules-testing suites + feedback. |
| integrations/wazuh_decoder_rule_tool/requirements.txt | Adds Python dependencies for running the tool (FastAPI/Uvicorn/ML libs). |
| integrations/wazuh_decoder_rule_tool/README.md | Documents local HTTPS run instructions, remote VM mode, and ML training workflow. |
| integrations/wazuh_decoder_rule_tool/ML_ENHANCEMENT_SUMMARY.md | Documents ML feature-engineering + ensemble approach and future tuning ideas. |
| integrations/wazuh_decoder_rule_tool/key.pem | Adds a private key file (should not be committed). |
| integrations/wazuh_decoder_rule_tool/generated/decoders/local_myapp_decoder_20260307094900.xml | Adds generated decoder XML output artifact. |
| integrations/wazuh_decoder_rule_tool/generated/decoders/local_myapp_decoder_20260307094544.xml | Adds generated decoder XML output artifact (duplicate-style). |
| integrations/wazuh_decoder_rule_tool/data/datasets/val.jsonl | Adds validation dataset records for ML training. |
| integrations/wazuh_decoder_rule_tool/data/datasets/feedback.jsonl | Adds feedback dataset examples used for training/tuning. |
| integrations/wazuh_decoder_rule_tool/data/datasets/feedback_rejections.jsonl | Adds rejected feedback examples for analysis/training workflows. |
| integrations/wazuh_decoder_rule_tool/certs/localhost.key | Adds a private TLS key for local HTTPS (should not be committed). |
| integrations/wazuh_decoder_rule_tool/certs/localhost.crt | Adds a self-signed TLS certificate for local HTTPS. |
| integrations/wazuh_decoder_rule_tool/cert.pem | Adds a certificate artifact for local HTTPS usage. |
| integrations/wazuh_decoder_rule_tool/app/wazuh_logtest.py | Adds a helper to run wazuh-logtest via SSH (currently hardcoded/inconsistent). |
| integrations/wazuh_decoder_rule_tool/app/templates/index.html | Adds the single-page HTML UI for the tool. |
| integrations/wazuh_decoder_rule_tool/app/static/styles.css | Adds styling for the UI. |
| integrations/wazuh_decoder_rule_tool/app/static/app.js | Adds UI logic for navigation, generate/test flows, ML status, AI generation, feedback, history. |
| integrations/wazuh_decoder_rule_tool/app/decoder_ml.py | Adds baseline TF‑IDF similarity models + parsing utilities for decoders/rules. |
| integrations/wazuh_decoder_rule_tool/app/decoder_ml_enhanced.py | Adds enhanced feature engineering + ensemble similarity model + compatibility wrapper. |
| integrations/wazuh_decoder_rule_tool/.gitignore | Adds ignores for venv/cache/model/repo directories. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| function toggleConditionsRow() { | ||
| const req = document.getElementById('ruleRequirement').value.trim(); | ||
| document.getElementById('ruleFieldConditionsRow').style.display = req ? 'flex' : 'none'; | ||
| document.getElementById('ruleMatchConditionsRow').style.display = req ? 'flex' : 'none'; |
Comment on lines
+18
to
+35
| try: | ||
| # This might fail if no Wazuh repo is available, but that's OK for this test | ||
| model = ensure_ml_model_enhanced(force_refresh=False, use_ensemble=True) | ||
| # If we get here without exception, the function works | ||
| assert model is not None or model is None # Either is fine | ||
| print("✓ ensure_ml_model_enhanced executed successfully") | ||
| return True | ||
| except Exception as e: | ||
| print(f"✗ ensure_ml_model_enhanced failed: {e}") | ||
| return False | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| success = test_ensure_ml_model_enhanced() | ||
| if success: | ||
| print("Integration test passed!") | ||
| else: | ||
| print("Integration test failed!") |
| parts.extend([self.prematch] * int(prematch_weight)) | ||
| if self.regex: | ||
| # Extract meaningful tokens from regex | ||
| regex_tokens = re.findall(r'\[\\w\+\\]|\\\\d\+|\\\\S\+|\\\\w\+', self.regex) |
Comment on lines
+38
to
+51
|
|
||
| parts = [] | ||
| if self.name: | ||
| parts.extend([self.name] * int(name_weight)) | ||
| if self.program_name: | ||
| parts.extend([self.program_name] * int(program_weight)) | ||
| if self.prematch: | ||
| parts.extend([self.prematch] * int(prematch_weight)) | ||
| if self.regex: | ||
| # Extract meaningful tokens from regex | ||
| regex_tokens = re.findall(r'\[\\w\+\\]|\\\\d\+|\\\\S\+|\\\\w\+', self.regex) | ||
| parts.extend(regex_tokens * int(regex_weight)) | ||
| if self.order: | ||
| parts.extend(self.order * int(order_weight)) |
Comment on lines
+1
to
+5
| -----BEGIN PRIVATE KEY----- | ||
| MIIJQgIBADANBgkqhkiG9w0BAQEFAASCCSwwggkoAgEAAoICAQDeCJuheTkfwUSK | ||
| shHW/6XR28sohDtaA+BgE5VQhA/dO0A0OD4Y+FHFvwqDZg4j74mZ1s4BBxdercSO | ||
| l1NXmfTJvH0WhY09vSyS3g4N/T1unrtTFUTrC3Dc5ovLAxAUe2AHLGhQcXGWRbTq | ||
| pEL1KEoYG89DSisTjSBOcoM3dE8fnU2Gc7YCvLUh8IpIaYLr0GOiQumAGhxIyWGq |
Comment on lines
+14
to
+18
| # Cache directories and ML models | ||
| data/models/ | ||
| data/wazuh_repo/ | ||
| data/wazuh_ruleset_repo/ | ||
|
|
| @@ -0,0 +1,3 @@ | |||
| {"log":"03-17 16:13:38.811 1702 2395 D WindowManager: printFreezingDisplayLogsopening app wtoken = AppWindowToken{9f4ef63 token=Token{a64f992 ActivityRecord{de9231d u0 com.tencent.qt.qtl/.activity.info.NewsDetailXmlActivity t761}}}, allDrawn= false, startingDisplayed = false, startingMoved = false, isRelaunching = false","decoder":{"name":"myapp-event","parent":"myapp","prematch":"WindowManager:","regex":"(\\d+-\\d+ \\d+:\\d+:\\d+.\\d+) \\d+ \\d+ \\S WindowManager: \\S+ \\S+ wtoken = (\\.+) token=(\\.+), allDrawn= (\\S+)","order":["logtime","wtoken","token","allDrawn"],"source_file":"feedback/windowmanager.json"}} | |||
| {"log":"20171223-22:15:33:144|Step_SPUtils|30002312| getTodayTotalDetailSteps = 1514038440000##7013##548365##8661##12836##27176966","decoder":{"name":"myapp-event","parent":"myapp","prematch":"Step_SPUtils","regex":"(\\.+)\\|Step_SPUtils\\|30002312\\| getTodayTotalDetailSteps = (\\.+)","order":["logtime","getTodayTotalDetailSteps"],"source_file":"feedback/pipemetric.json"}} | |||
| {"timestamp": "2026-05-16T08:56:11.647689Z", "approved": true, "log": "May 16 14:22:31 plc-gateway01 scada-engine[2241]: ALERT Modbus unauthorized write request detected from 10.10.50.24 function_code=0x10 register=40123", "extract_fields": ["srcip", "funtion_code"], "notes": "", "decoder": {"name": "myapp-event", "parent": "myapp", "prematch": "scada-engine", "regex": "ALERT\\s+Modbus\\s+unauthorized\\s+write\\s+request\\s+detected\\s+from\\s+(\\d+.\\d+.\\d+.\\d+)\\s+function_code=(\\d+x\\d+)\\s+register=\\d+", "order": ["srcip", "function_code"], "source_file": "feedback/myapp.json"}, "target_text": "myapp-event myapp scada-engine alert\\s+modbus\\s+unauthorized\\s+write\\s+request\\s+detected\\s+from\\s+(\\d+.\\d+.\\d+.\\d+)\\s+function_code=(\\d+x\\d+)\\s+register=\\d+ srcip function_code feedback/myapp.json"} | |||
| {"timestamp": "2026-04-29T05:52:13.354712Z", "approved": false, "app_name": "myapp", "log": "[2026-04-29T04:29:06,056][INFO ][o.o.s.s.c.FlintStreamingJobHouseKeeperTask] [node-1] Starting housekeeping task for auto refresh streaming jobs.", "extract_fields": ["logtime", "loglevel", "message"], "notes": "[(\\d+-\\d+-\\S+:\\d+:\\d+,\\d+)][(\\S+)\\s][\\.+] [\\S+] (\\.+)"} | ||
| {"timestamp": "2026-04-29T08:50:41.323760Z", "approved": false, "app_name": "myapp", "log": "[2026-04-29T04:29:06,056][INFO ][o.o.s.s.c.FlintStreamingJobHouseKeeperTask] [node-1] Starting housekeeping task for auto refresh streaming jobs.", "extract_fields": [], "notes": "It should be corrected like this"} | ||
| {"timestamp": "2026-04-29T08:50:41.368350Z", "approved": false, "app_name": "myapp", "log": "[2026-04-29T04:29:06,056][INFO ][o.o.s.s.c.FlintStreamingJobHouseKeeperTask] [node-1] Starting housekeeping task for auto refresh streaming jobs.", "extract_fields": [], "notes": "It should be corrected like this"} | ||
| {"timestamp": "2026-05-16T08:56:23.312599Z", "approved": false, "app_name": "myapp", "log": "May 16 14:22:31 plc-gateway01 scada-engine[2241]: ALERT Modbus unauthorized write request detected from 10.10.50.24 function_code=0x10 register=40123", "extract_fields": ["srcip", "funtion_code"], "notes": ""} |
Comment on lines
+71
to
+78
| For this workspace, the app now defaults to: | ||
|
|
||
| ```bash | ||
| WAZUH_SSH_HOST=192.168.56.10 | ||
| WAZUH_SSH_PORT=22 | ||
| WAZUH_SSH_USER=vagrant | ||
| WAZUH_SSH_PASSWORD=vagrant | ||
| ``` |
Comment on lines
+4
to
+18
| WAZUH_HOST = "127.0.0.1" | ||
| WAZUH_PORT = "2222" | ||
| WAZUH_USER = "vagrant" | ||
|
|
||
| # read from environment variable | ||
| WAZUH_LOGTEST = os.getenv("WAZUH_LOGTEST_PATH", "/var/ossec/bin/wazuh-logtest") | ||
|
|
||
|
|
||
| def run_logtest(log_line): | ||
| cmd = [ | ||
| "ssh", | ||
| "-p", WAZUH_PORT, | ||
| f"{WAZUH_USER}@{WAZUH_HOST}", | ||
| f"sudo {WAZUH_LOGTEST}" | ||
| ] |
- Hybrid AI generation: programmatic base XML (guaranteed correct) + AI review for regex improvement - wazuh-logtest always checked before AI generation to determine parent strategy - Parent decoder uses <program_name> when log has a decoded program name - Fields already decoded by built-in decoders are skipped automatically - AI prompt refocused on reviewing/improving regex patterns instead of writing XML from scratch - Git subprocess calls now have timeouts to prevent startup hangs - Updated README with AI provider setup and hybrid approach documentation
…ation - Removed Decoder Generator and Rule Generator sections from HTML - Moved input fields (appName, logsInput, extractFields, etc.) into AI view - Removed 'Generate Decoder' and 'Generate Rule' sidebar nav items - Made 'AI Generate' the default active view - Cleaned up app.js: removed unused functions (showAnalysis, showXml, syncFeedback, readRulePayload, rule conditions UI, old button handlers) - Updated history loading and test function to work without decoder view
- Added POST /api/install endpoint to write decoder/rule XML to Wazuh's custom decoders/rules directories (SSH or local) - Added POST /api/uninstall endpoint to remove installed files - Added POST /api/logtest/raw endpoint for running wazuh-logtest with arbitrary log samples and returning raw output + parsed fields - Redesigned Test view with three cards: Installed Decoder (install/ uninstall), Test Logs (editable sample input), and wazuh-logtest Output (raw stdout + parsed fields table) - Added state management storing installed file paths in localStorage - AI-generated XML is now persisted in JS so it can be installed from the Test view without re-running AI generation
…ailure - Add generation_mode (auto/decoder_only/rule_only/both) to AI request - Add validate_with_logtest flag and /api/ai/generate-validated endpoint - Add _collect_ai_response, _extract_xml_from_ai_response helpers - Add _validate_ai_decoder_with_logtest for auto-install+test validation - Refactor _build_ai_prompt: shorter config block, concise ML/logtest context - Add system prompt for Ollama (system+user roles), fix URL path - Lower default temperature to 0.05 for more deterministic output - Default model changed to wazuh-decoder - UI: generation mode dropdown, validate checkbox, Generate & Validate button - UI: show validation badge & details in AI output section - UI: hide rule section when generation_mode=decoder_only
…ndpoint and automate rule group/static field sanitization
…coring, and sigmoid calibration - Add log-type detection (_detect_log_type) with type-based boosting to bias results toward relevant decoder families (JSON, Windows, syslog, etc.) - Add regex token overlap scoring (_regex_overlap_score) to boost patterns whose OS_Regex tokens match query log literals - Add sigmoid confidence calibration for well-calibrated probabilities in [0,1] - Tune ensemble weights: TF-IDF 0.3, SBERT 0.7 (semantic model is stronger for unseen formats) - Raise minimum confidence gate to 0.15 to avoid low-confidence noise - Add fine-tuned SBERT checkpoint loading with graceful fallback - Enhance tokenizer to preserve more OS_Regex character classes
… Modelfile - Lower temperature (0.05→0.02) and top_p (0.85→0.80) for more deterministic output - Increase repeat_penalty (1.15→1.20) and lower top_k (20→15) to reduce repetition - Add self-validation checklist to catch common errors before output - Add JSON log decoder and DHCP/MAC address examples - Fix sshd example to use same decoder name for multiple children - Add instruction: 'No text before or after' the XML block
…lization - Default OLLAMA_BASE_URL to http://localhost:11434/v1 so it works without env vars - Normalize /v1 suffix to prevent double-/v1 404 errors in URL construction - Add 60s timeout to streaming client with retry on ReadTimeout (up to 3 attempts) - Add decoder rule: multiple child decoders must use exact same decoder name - Fix IP regex guidance: do not escape dots in \d+.\d+.\d+.\d+ - Update top_k to 15 and repeat_penalty to 1.20 to match Modelfile tuning - Improve error messages for network/server issues
…to dataset builder - Add load_rejection_records(): convert rejection notes with regex corrections into positive training pairs - Add augment_with_dropout(): create robustness variants by randomly masking log tokens (15% prob) - Rejection corrections teach SBERT to distinguish correct from broken regex patterns - Dropout augmentation teaches model that partial log lines still map to same decoder - Add structured logging of record counts throughout pipeline
…nting to SBERT training - 5 epochs with best-checkpoint saving (by validation AUC) - Larger batch size (64 configurable) for better in-batch negatives with MultipleNegativesRankingLoss - Hard-negative augmentation: pair logs with categorically distinct decoders (30% ratio) - Token dropout data augmentation for robustness on partial input - Early stopping with patience=2 epochs - Add binary evaluator with both positive and negative pairs for AUC measurement - Configurable training device (default CPU to avoid MPS OOM with Ollama) - Copy best checkpoint to 'final' directory for easy model loading
The sidebar defaulted to AI Generate as active, but the corresponding #view-ai div was missing the 'active' class, so CSS display:none kept the entire AI generation page blank on initial load.
…egex instruction The AI model consistently escapes dots (\.) in regex patterns because it is trained on PCRE where this is correct. Wazuh OS_Regex treats '.' as a literal character, so \. is wrong syntax. Fix: - Add _sanitize_decoder_xml_osregex() that strips \. → . in generated XML - Apply it in _extract_xml_from_ai_response and the final return - Strengthen the Modelfile and prompt instruction with WRONG/RIGHT examples to make the rule impossible to miss
- Fix sanitization regex: r'\.' was matching any char after backslash (breaking \d, \w, etc.). Use r'\\.' to match only backslash + literal dot. - Add Example 7 to Modelfile showing correct TrafficLog IP extraction with unescaped dots in OS_Regex - Strengthen prompt WRONG/RIGHT examples for IP regex
The streaming /api/ai/generate endpoint returns raw AI text without server-side processing, so escaped dots (\.) pass through to the browser. Add sanitizeOsRegex() in app.js that strips \. → . client-side after XML extraction, covering both the streaming and validated endpoints.
Add a hand-curated example with unescaped dots for IP regex (\d+.\d+.\d+.\d+) so the fine-tuned model natively learns correct OS_Regex IP syntax instead of relying on post-processing.
…gex instructions - Add _stream_ai_sanitized to post-process AI output and fix \d+\.\d+ → \d+.\d+ (common AI mistake: escaping dots before \d for IPs in Wazuh OS_Regex) - Enhance _sanitize_decoder_xml_osregex to target IP patterns specifically, only removing \. between \d quantifiers, not valid \.+ any-char quantifiers - Update Modelfile with clearer IP regex instructions and new example conversations - Update Modelfile.finetune with more training examples (iptables, squid, UFW, TrafficLog, CEF Palo Alto, nginx, SSH, netfilter, KV log) - Fix kv-log-fields decoder order to match extract_fields
…ex sanitizer - Remove programmatic XML from ai_generate and ai_generate_validated — AI now generates from scratch using only analysis context - Add _fix_osregex_bare_dot_quantifier: converts common AI mistakes (.+) → (\S+), .+ → \.+, .* → \.+ inside regex/prematch tags - Update _OLLAMA_SYSTEM_PROMPT with explicit anti-pattern examples showing CORRECT vs WRONG OS_Regex patterns - Strengthen _build_ai_prompt decoder_rules with OS_Regex constraints and anti-echo instructions - Update Modelfile and Modelfile.finetune with anti-pattern section
…d bare-dot sanitizer - Remove raw streaming output (#aiOut) from UI — only show final extracted XML - Add Reference Field-to-Pattern Mapping in _build_ai_prompt: programmatic regex patterns as text guidance (not XML blocks AI can echo) - Add _infer_osregex_type helper to suggest correct OS_Regex pattern per field - Add _build_fallback_decoder: silently builds programmatic decoder when AI produces no valid XML (uses user inputs like field_hints) - Add _fix_osregex_bare_dot_quantifier: sanitizes (.+) → (\S+), .+ → \.+, .* → \.+ inside regex/prematch tags - Update ai_generate_validated to fall back when all retries fail - Remove ai-stream-block CSS (unused)
…d-aid sanitization AI now handles structure (decoder names, hierarchy, order tags) but regex patterns come from the proven programmatic engine. _inject_programmatic_regex matches each <regex> to the next <order> and replaces the content with the correct regex from analysis regex_order_pairs. - Remove unreliable bare-dot/IP sanitizers for regex content - _extract_xml_from_ai_response accepts regex_order_pairs param - ai_generate and ai_generate_validated pass analysis data through
- Remove _inject_programmatic_regex, _build_fallback_decoder - Remove _FIELD_PATTERN_MAP, _infer_osregex_type - Remove Reference Field-to-Pattern Mapping from _build_ai_prompt - _sanitize_decoder_xml_osregex reverts to band-aid fixes only - _extract_xml_from_ai_response no longer takes regex_order_pairs - ai_generate and ai_generate_validated have zero programmatic fallback AI generates everything (structure + regex) independently.
…ript AI now generates everything (structure + regex) independently. Only band-aid sanitization remains: (.+) → (\S+), .+ → \.+, \d+\.\d+ → \d+.\d+. Add scripts/train_osregex.py — extracts 26 training pairs from Modelfile.finetune into JSONL format and provides training commands for Ollama 0.5+, Unsloth, llama.cpp, and Axolotl.
…regex correction - scripts/generate_finetuning_data.py: downloads all 104 decoder + 133 rule XMLs from wazuh-ruleset, generates 806 training pairs (725 train / 81 val) in JSONL format - app/main.py: _inject_correct_regex silently replaces AI <regex> content with analysis-derived patterns; _INTERNAL_FIELD_REGEX maps field names to correct OS_Regex - scripts/train_osregex.py: points to new 806-example dataset - .gitignore: add .cache_decoders/
…refix regex generation - sanitizeOsRegex disabled: backend _inject_correct_regex already fixes patterns - build_split_regexes_from_fields: better first-field prefix handling (no \.+ prefix for start-of-log fields); use (\.+) for multi-word/multi-token field values
…ead of preceding words
…on; expand CEF field aliases - app.js: checkExistingDecoder() calls /api/analyze first and shows confirm() dialog if a builtin decoder already matches - main.py: add 'source', 'destination', 'port' aliases for CEF field mapping
…gex token patterns - AI prompt: provide correct prematch when no program_name is pre-decoded - AI prompt: add both <program_name> and <prematch> strategy examples - decoder_ml_enhanced.py: fix over-escaped regex tokens in enhanced tokenizer - wazuh_logtest.py: use env vars with fallback defaults instead of hardcoded values - .gitignore: exclude certs/, *.pem, *.key, *.crt - README.md: make SSH config docs generic
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a new integration named
wazuh_decoder_rule_tool— a FastAPI-based tool for analyzing logs, checking existing Wazuh decoder/rule matches throughwazuh-logtest, and generating custom decoder and rule XML.New Features
AI-Powered Generation (Hybrid Approach)
<program_name>when available,<prematch>otherwise)Enhanced ML Decoder Similarity
Improved Decoder Generation
Robustness & Reliability
\.vs.semantics)Included
Testing
The app can be tested locally:
mkdir -p certs openssl req -x509 -newkey rsa:4096 -keyout certs/localhost.key -out certs/localhost.crt -days 365 -nodes -subj "/CN=localhost"Access the application via
https://localhost:8443.Connecting to Wazuh VM for wazuh-logtest
Example Scenario
May 19 12:34:56 custom-server myapp[1234]: User 'admin' failed to authenticate from IP 192.168.1.100 due to invalid_passworduser,srcip)